[% setvar title Regex: Support for incremental pattern matching %]
<div id="archive-notice">
    <h3>This file is part of the Perl 6 Archive</h3>
    <p>To see what is currently happening visit <a href="http://www.perl6.org/">http://www.perl6.org/</a></p>
</div>
<div class='pod'>
<a name='TITLE'></a><h1>TITLE</h1>
<p>Regex: Support for incremental pattern matching</p>
<a name='VERSION'></a><h1>VERSION</h1>
<pre>  Maintainer: Damian Conway &lt;<a href='mailto:damian@conway.org'>damian@conway.org</a>&gt;
  Date: 11 Aug 2000
  Last Modified: 18 Sep 2000
  Number: 93
  Version: 3
  Mailing List: <a href='mailto:perl6-language-regex@perl.org'>perl6-language-regex@perl.org</a>
  Status: Frozen</pre>
<a name='ABSTRACT'></a><h1>ABSTRACT</h1>
<p>This RFC proposes that, in addition to strings, subroutine references may be
bound (with =~ or !~ or implicitly) to a regular expression.</p>
<a name='DESCRIPTION'></a><h1>DESCRIPTION</h1>
<p>It is proposed that the Perl 6 regular expression engine be extended to
allow a regex to match against an incremental character source, rather than
only against a fixed string.</p>
<p>Specifically, it is proposed that a subroutine reference could be bound
to a regular expression:</p>
<pre>	sub {...} =~ /pattern/;</pre>
<p>As the regular expression is matched, it would make calls to the subroutine
to request additional characters to match, or (after it has matched) to
return any unused characters.</p>
<p>When the regex engine requires additional characters to match, the
subroutine would be called with a single argument, and would be expected
to return a character string containing the extra characters. The single
argument would specify how many characters should be returned (typically
this would be 1, unless internal analysis by the regex engine can deduce
that more than one character will be required). Returning fewer than the
requested number of characters would typically indicate a premature
end-of-string and would probably trigger backtracking and/or failure to
match.</p>
<p>When the match is finished, the subroutine would be called one final time,
and passed two arguments: a string containing the &quot;unused&quot; characters (what
would be $' for a fixed string), and a flag set to 1. The subroutine
could use this call to push-back (or cache) unused data. In the case of
a failure to match (or success of the !~ operator), every character requested
during the match would be sent back.</p>
<p>A typical structure for a subroutine against which a regex was matched
would therefore be:</p>
<pre>	sub s {
	    if ($_[1]) {		# &quot;putback unused data&quot; request
	        recache($_[0]);
	    }
	    else {			# &quot;send more data&quot; request
		return get_chars(max=&gt;$_[0])
	    }
	}</pre>
<a name='Examples'></a><h2>Examples</h2>
<p>The most obvious example would be matching against an input stream:</p>
<pre>	sub { $_[1] ? $fh-&gt;pushback($_[0]) : $fh-&gt;getn($_[0]) } =~ /pat/;</pre>
<p>which could also be written:</p>
<pre>	^1 ? $fh-&gt;pushback(^0) : $fh-&gt;getn(^0)	=~ /pat/;</pre>
<p>Of course, it would often be useful to have a subroutine that returns a
closure on a particular filehandle:</p>
<pre>	sub fhmatch { ^1 ? $_[0]-&gt;pushback(^0) : $_[0]-&gt;getn(^0) }

	fhmatch($fh)     =~ /pat/
	fhmatch(\*STDIN) =~ /pat/
	# etc.</pre>
<p>In fact, this might be so commonly useful that matching against a
file handle should be made to work directly. That is:</p>
<pre>	$fh =~ /pat/
	\*STDIN =~ /pat/
	</pre>
<p>One could then do interactive lexing cleanly:</p>
<pre>	until (eof $fh) {
		switch ($fh) {
			/^\s*/;			# skip leading whitespace
			case /^(lexeme1)/	{ push @tokens, $1=&gt;LEX1 }
			case /^(lexeme2)/	{ interact_somehow }
			case /^(lexeme3)/	{ push @tokens, $1=&gt;LEX3 }
			# etc.
		}
	}</pre>
<p>Note the use of the proposed PAIR data structure to store tokens
in the above example.</p>
<p>Because the character source is a subroutine, one could also match against
data coming out of a socket:</p>
<pre>	my $cache = &quot;&quot;;

	sub matching_socks {
		if ($_[1]) { $cache .= $_[0]; return }	# putback
		if (length($cache) &lt; $_[0]) {		# not enough cached
			my $extra;			# so get some more
			recv(SOCKET, $extra, $_[0]-length($cache));
			$cache .= $extra;
		}
		return substr($cache,0,$_[0],&quot;&quot;);
	}

	switch (\&amp;matching_socks) {
		case /pat1/ { action1() }
		case /pat2/ { action1() }
		case /pat3/ { action1() }
		#etc.
	}</pre>
<p>or any other source:</p>
<pre>	sub mega_ape {
		return join &quot;&quot;, map {['a'..'z',(' ')x6]-&gt;[rand 32]} (1..$_[0])
			unless $_[1]
	}

	\&amp;mega_ape =~ /Now is the Winter of our discontent.../i;

	print &quot;Art after &quot;, length($`), &quot;chars\n&quot;;</pre>
<a name='IMPLEMENTATION'></a><h1>IMPLEMENTATION</h1>
<p>Dammit, Jim, I'm a doctor, not an magician!</p>
<p>Probably needs to be integrated with IO disciplines too.</p>
<a name='REFERENCES'></a><h1>REFERENCES</h1>
<p>RFC 22: Builtin switch statement</p>
<p>RFC 23: Higher order functions</p>
<p>RFC 84: Replace =&gt; (stringifying comma) with =&gt; (pair constructor)</p>
</div>
